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Abstract 

We present the new multi-threaded version of the state-of-the-art answer set solver clasp. We detail 
its component and communication architecture and illustrate how they support the principal func- 
tionalities of clasp. Also, we provide some insights into the data representation used for different 
constraint types handled by clasp. All this is accompanied by an extensive experimental analysis of 
the major features related to multi-threading in clasp. 

PUBLICATION NOTE: To appear in Theory and Practice of Logic Programming 



1 Introduction 

The increasing availability of multi-core technology offers a great opportunity for fur- 



ther improving the performance of solvers for Answer Set Programming (ASP; (Baral 



2003} ). This paper describes how we redesigned and reimplemented the award-winning 



ASP solver clasp (Gebser et al. 2007b i in order to leverage the power of today's multi-core 
shared memory machines by supporting parallel search. To this end, we chose a coarse- 
grained, task-parallel approach via shared memory multi-threading. This has led to the 
clasp 2 series supporting a single- and a multi-threaded variant sharing a common code 
base, clasp allows for parallel solving by search space splitting and/or competing strate- 
gies. While the former involves dynamic load balancing in view of highly irregular search 
spaces, both modes aim at running searches as independently as possible in order to take 
advantage of enhanced sequential algorithms. In fact, a portfolio of solver configurations 
cannot only be used for competing but also in splitting-based search. The latter is option- 
ally combined with global restarts to escape from uninformed initial splits. 

For promoting the scalability of parallel search, all major routines of clasp 2 are lock- 
free. Also, we enforced a clear distinction between read-only, shared, and thread-local data 
and incorporated accordingly optimized representations. This is implemented by means 
of Intel's Threading Building Blocks (TBB) for providing platform-independent threads, 
atomics, and concurrent containers. Currently, clasp supports up to 64 configurable (non- 
hierarchic) threads. Apart from parallel search, another major extension of previous ver- 
sions of clasp regards the exchange of recorded nogoods. While unary, binary, and ternary 



* Affiliated with Simon Fraser University, Canada, and Griffith University, Australia. 

1 The multi-threaded variant of clasp 2 won the first place in the Crafted/UNSAT and the second place in the 
Crafted/SAT+UNSAT category, respectively, at the 2011 SAT competition in terms of number of solved in- 
stances and wall-clock time. In addition, clasp 2 was among the three genuine parallel solvers participating in 
the 32 cores track (restricted to benchmarks from the Application category; the fourth solver used a portfolio, 
including clasp 1.3). Also, clasp 2 participated "out of competition" at the 2011 ASP competition, which was 
dominated by the single-threaded variant of clasp 2. 
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nogoods are always shared among all threads, longer ones can optionally be exchanged, 
configurable at the sender as well as at the receiver side. In fact, clasp provides different 
measures estimating the quality of shared nogoods as well as various heuristics and filters 
for controlling their integration. For instance, the sharing of a nogood can be subject to the 
number of distinct decision levels associated with its literals. Conversely, the integration of 
a nogood may depend on its satisfaction and/or scores in host heuristics. 

In view of the wide distribution of clasp, we put a lot of effort into transferring the 
entire functionality from the sequential, viz. clasp series 1.3, to the parallel setting. For 
one, this concerned clasp's, reasoning modes (cf. (Gebser et al. 20lTa} ), including enumer- 
ation, projected enumeration, intersection and union of models, and optimization. More- 
over, we extended clasp's, language capacities by allowing for solving weighted and/or par- 
tial MaxSAT (Li and Manya 2009) as well as Boolean optimization (Marques-Silva et al 



|201 \) problems. Finally, it goes without saying that clasp's basic infrastructure has also 
significantly evolved with the new design; e.g. the preprocessing capacities of clasp were 
extended with blocked clause elimination ( |Jarvisalo et al. 2 010), and its conflict analysis 
has been significantly improved by on-the-fly subsumption (Han and Somenzi 2009). 

In what follows, we focus on describing the multi-threaded variant of clasp 2. To this 
end, the next section provides a high-level view on modern parallel ASP solving. The gen- 
eral component and communication architecture of the new version of clasp are presented 
in Section[3]and|4] Section|5]details the design of data structures underlying the implemen- 
tation of clasp 2. Parallel search features of clasp 2 are empirically assessed in Section [6] 
Finally, Section|7]and|8]discuss related work and the achieved results, respectively. 



2 Parallel ASP Solving 

We presuppose some familiarity with search procedures for (Boolean) constraint solving, 
that is, Davis-Putnam-Logemann-Loveland (DPLL; (Davis and Putnam 1960 Davis et al. 



1962)) and Conflict-Driven Constraint Learning (CDCL; (Marques-Silva and Sakallah 



1999 Zhang et al. 200 1|). In fact, (sequential) ASP solvers like smodels (Simons et al. 



2002 1 adopt the search pattern of DPLL based on systematic chronological backtracking, 
or like clasp (series 1.3) apply lookback techniques from CDCL, which include conflict- 
driven learning and non-chronological backjumping. In what follows, we primarily con- 
centrate on CDCL and principal points for its parallelization in the clasp 2 series. 

In order to solve the basic decision problem of solution existence, CDCL first extends a 
given (partial) assignment via deterministic (unit) propagation. Importantly, every derived 
literal is "forced" by some nogood (set of literals that must not jointly be assigned), which 
would be violated if the literal's complement were assigned. Although propagation aims 
at forgoing nogood violations, assigning a literal forced by one nogood may lead to the 
violation of another nogood; this situation is called conflict. If the conflict can be resolved 
(the violated nogood contains backtrackable literals), it is analyzed to identify a conflict 
constraint. The latter represents a "hidden" conflict reason that is recorded and guides 
backjumping to an earlier stage such that the complement of some formerly assigned literal 
is forced by the conflict constraint, thus triggering propagation. Only when propagation 
finishes without conflict, a (heuristically chosen) literal can be assigned at a new decision 
level, provided that the assignment at hand is partial, while a solution (total assignment 
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while work available 

while no (result) message to send 

communicate II exchange information with other solver instances 

propagate II deterministically assign literals 

if no conflict then 

if all variables assigned then send solution 

else decide II non-deterministically assign some literal 

else 

if root-level conflict then send unsatisfiable 
else if external conflict then send unsatisfiable 
else 

analyze II analyze conflict and add conflict constraint 

backjump II unassign literals until conflict constraint is unit 

communicate II exchange results with (and receive work from) other solver instances 

Fig. 1. High-level algorithm for multi-threaded Conflict-Driven (Boolean) Constraint 
Learning. 



not violating any nogood) has been found otherwise. The eventual termination of CDCL 
is guaranteed (cf. (Zhang and Malik 2003} Ryan 2004)), by either returning a solution or 
encountering an unresolvable conflict (independent of unforced decision literals). 

Figure[T|provides a high-level view on the parallelization of CDCL-style search in clasp. 
We first note that entering the inner search loop relies on the availability of work. In fact, 



when search spaces to investigate in parallel are split up by means of guiding paths (Zhang 



et al. 1996), a solver instance must acquire some spare guiding path before it can start to 



search. In this case, all (decision) literals of the guiding path are assigned up to the solver's 
root level, precluding them from becoming unassigned upon backtracking/backjumping. 
Apart from search space splitting, parallelization of clasp can be based on algorithm port- 



folios (Gomes and Selman 2001 1, running different solving strategies competitively on the 
same search space. Once a solver instance is working on some search task, it combines 
deterministic propagation with communication. The latter includes nogood exchange with 
other solver instances, work requests from idle solvers (asking for a guiding path), and ex- 
ternal conflicts raised to abort the current searchj^An external conflict or an (unresolvable) 
root-level conflict likewise make a solver instance stop its current search, and the same ap- 
plies when a solution is found. In such a case, the respective result is communicated (in the 
last line of Figure [TJ, and a new search task may be received in turn. 

As mentioned in the introductory section, the infrastructure of clasp also allows for 
conducting sophisticated reasoning modes like enumeration and optimization in parallel. 
This is accomplished via enriched message protocols, e.g. (upper) bounds are exchanged 
in addition to nogoods when performing parallel optimization, while an external conflict 
(raised upon finding the first solution) switches competing solvers of an algorithm portfolio 



2 For instance, a solver instance may discover unconditional unsatisfiability (even when using guiding paths; cf. 
(Ellguth et al. 2009)) and then inform others about the needlessness of performing further work. 
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Fig. 2. Multi-threading architecture of clasp 2. 



into enumeration mode based on guiding paths. In fact, search space splitting and algorithm 
portfolios can be applied exclusively or be combined to flexibly orchestrate parallel solvers. 

In the following sections, we detail the parallel architecture and underlying implementa- 
tion techniques of clasp 2. Regarding data structures, it is worthwhile to note that unit prop- 
agation over "long" nogoods (involving more than three literals) relies on a two-watched- 
literals approach (Moskewicz et al. 2001 1, monitoring two references to unassigned literals 
for triggering propagation once the second last literal becomes assigned. We also presup- 
pose basic familiarity with parallel computing concepts, such as race conditions, atomic 
operations, (dead- and spin-) locks, semaphores, etc. (cf. ( Herlihy and Shavit 2 008 )). 



3 Component Architecture 

To explain the architecture and functioning of the new version of clasp, let us follow the 
workflow underlying its design. To this end, consider clasp's architectural diagram given in 
Figure[2] Although clasp also accepts other input formats, like (extended) dimacs, opb, and 



wbo for describing Boolean satisfiability (SAT; ( |Biere et al. 2009 1) and optimization prob- 
lems, we detail its functioning for computing answer sets of (propositional) logic programs, 
as output by grounders like gringo ( |Gebser et al. 201 la] ) or Iparse (Syrjanen). Similarly, 
we concentrate on the multi-threaded setting, neglecting the single-threaded one. 

At the start, only the main thread is active. Once the logic program is read in, it is subject 
to several preprocessing stages, all conducted by the main thread. At first, the program is 
(by default) simplified while identifying equivalences among its constituents (Gebser et al. 
2008). The simplified program is then transformed into a compact representation in terms 



of Boolean constraints (whose core is generated from the completion (Clark 1978) of the 
simplified program). After that, the constraints are (optionally) subject to further, mostly 



SAT-based preprocessing (Een and Biere 2005 Jarvisalo et al. 20101). Such techniques are 



more involved in our ASP setting because variables relevant to unfounded-set checking, 
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optimization, or part of complex (i.e. cardinality and weight) constraints cannot be simply 
eliminated. Note that both preprocessing steps identify redundant variables that can be 
expressed in terms of the relevant ones included in the resulting set of constraints. 

The outcomes of the preprocessing phase are stored in a SharedContext object that is 
initialized by the main thread and shared among all participating threads. Among others, 
this object contains 

• the set of relevant Boolean variables together with type information 
(e.g. atom, body, aggregate, etc.), 

• a symbol table, mapping (named) atoms from the program to internal variables, 

• the positive atom-body dependency graph, restricted to its strongly connected com- 
ponents, 

• the set of Boolean constraints, among them nogoods, cardinality and weight con- 
straints, minimize constraints, and 

• an implication graph capturing inferences from binary and ternary nogoods]^] 

The richness of this information is typical for ASP, and it is much sparser in a SAT setting. 

After its initialization in association with a "master solver," further (solver) threads are 
(concurrently) attached to the SharedContext, where its constraints are "cloned." Notably, 
each constraint is aware of how to clone itself efficiently (cf. Section|5]on implementation 
details). Moreover, the Enumerator and NogoodDistributor objects are used globally in or- 
der to coordinate various model enumeration modes and nogood exchange among solver 
instances. We detail their functioning in Section|4] 

Each thread contains one Solver object, implementing the algorithm in Figure [T] Each 
Solver stores 

• local data, including assignment, watch lists, constraint database, etc., 

• local strategies, regarding heuristics, restarts, constraint deletion, etc., 

and it uses the NogoodDistributor to share recorded nogoods. A solver assigns variables 
either by (deterministic) propagation or (non-deterministic) decisions. Motivated by the 
nature of ASP problems, 3 each solver propagates first binary and ternary nogoods (shared 
through the aforementioned implication graph), then longer nogoods and other constraints, 
before it finally applies any available post propagators. 

Post propagators constitute another important new feature of clasp 2, providing an ab- 
straction easing clasp's extensibility with more elaborate propagation mechanisms. For 
this, each solver maintains a list of post propagators that are consecutively processed af- 
ter unit propagation. For instance, failed-literal detection and unfounded-set checking are 
implemented in clasp 2 as post propagators. Similarly, they are used in the new version of 
clasp's extension with constraint processing, clingcon ( Gebser et al. 2009) , to realize the- 



ory propagation. Post propagators are assigned different priorities and are called in priority 
order. Typically, we distinguish three priority classes: 

• single post propagators are deterministic and only extend the current decision level. 
Unfounded-set checking is a typical example. 

3 ASP problems usually yield a large majority of binary nogoods due to program completion {Clark 1978} . Also 
note that unary nogoods capture initial problem simplifications that need not be rechecked during search. 
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• multi post propagators are deterministic and may add or remove decision levels. 
Failed-literal detection is a typical example. 

• complex post propagators may or may not be deterministic. 
Nogood exchange is an example for this (see below). 

Moreover, parallelism is also handled by means of post propagators, as described next. 

ParallelSolve controls concurrent solving with up to 64 individually configurable 
threads. When attaching a solver to the SharedContext, ParallelSolve associates a thread 
with the solver and adds dedicated post propagators to it. One high-priority post propaga- 
tor is added for message handling and another, very low -priority post propagator is supplied 
for integrating information stemming from model^] and/or shared nogoods. 

For controlling parallel search, ParallelSolve maintains a set of atomic message flags: 

• terminate signals the end of a computation, 

• interrupt forces outside termination (e.g. when the user hits Ctrl+C), 

• sync indicates that all threads shall synchronize, and 

• split is set during splitting-based search whenever at least one thread needs work. 
These flags are used to implement clasp's two major search strategies: 

• splitting-based search via distribution of guiding paths and dynamic load balancing 
via a split-request and -response protocol, and 

• competition-based search via freely configurable solver portfolios. 

Notably, solver portfolios can also be used in splitting-based search, that is, different guid- 
ing paths may be solved with different configurations. 

4 Communication Architecture 

A salient transverse aspect of the architecture of clasp 2 is its communication infrastruc- 
ture, used for implementing advanced reasoning procedures. To begin with, the Parallel- 
Solve object keeps track of threads' load, particularly in splitting-based search. Moreover, 
the Enumerator controls enumeration-based reasoning modes, while the NogoodDistributor 
handles the exchange of recorded nogoods among solver threads. These communication- 
intense components along with fundamental implementation techniques are detailed below 
in increasing order of complexness. 

4.1 Thread Coordination 

The basic communication architecture of clasp relies on message passing, efficiently im- 
plemented by lock-free atomic integers. On the one hand, globally shared atomic counters 
are stored in ParallelSolve. For instance, all aforementioned control flags are stored in a 
single shared atomic integer. On the other hand, each thread has a local message counter 
hosted by the message handling post propagator (see above). Message passing builds upon 
two basic methods: postMessage( ) and hasMessage( ) . Posting a message amounts 

4 This can regard an enumerated model to exclude, intersect, or union, as well as objective function values. 
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to a Compare-And-SwapQ (CAS) on an atomic integer, and checking for messages (via 
specialized post propagators) is equivalent to an atomic read. Of particular interest is com- 
munication during splitting-based search. This is accomplished via a lock-free work queue, 
an atomic work request counter, and a work semaphore in ParallelSolve. Initially, the work 
queue only contains the empty guiding path, and all threads "race" for this work package 
by issuing a work request. A work request first tries to pop a guiding path from the work 
queue and returns upon success. Otherwise, the work request counter is incremented and a 
split request is posted, which results in raising the split flag. Afterwards, a wait ( ) is tried 
on the work semaphore If wait ( ) fails because the number of idle threads now equals 
the total number of threads, the requesting thread posts a terminate message and wakes 
up all waiting threads. Otherwise, the thread is blocked until new work arrives. On the re- 
ceiver side, the message handling post propagator of each thread checks whether the split 
flag has been set. If so, and provided that the thread at hand has work to split, its message 
handler proceeds as follows. At first, it decrements the work request counter. (Note that 
the message handler thus declares the request as handled before actually serving it in order 
to minimize over-splitting.) If the work request counter reached 0, the message handler 
also resets the split flag. Afterwards, the search space is split and a (short) guiding path is 
pushed to the work queue in ParallelSolve. At last, the message handler signals the work 
semaphore and hence eventually wakes up a waiting thread. 

Splitting-based search usually suffers from uninformed early splits of the search space. 
To counterbalance this, ParallelSolve supports an advanced global restart scheme based 
on a two-phase strategy. In the first phase, threads vote upon effectuating a global restart 
based on some given criterion (currently, number of conflicts); however, individual threads 
may veto a global restart. For instance, this may happen in enumeration when a first model 
is found during this first restarting phase. Once there are enough votes, a global restart is 
initiated in the second phase. For this, a sync message is posted and threads wait until all 
solvers have reacted to this message. The last reacting thread decides on how to continue. 
If no veto was issued, the global restart is executed. That is, threads give up their guiding 
paths, the work queue is cleared, and the initial (empty) guiding path is again added to 
the work queue. Otherwise, the restart is abandoned, and the threads simply continue with 
their current guiding paths. 

If splitting-based search is not active (i.e. during competition-based search), the work 
queue initially contains one (empty) guiding path for each thread, and additional work 
requests simply result in the posting of a terminate message. 

4.2 Nogood Exchange 

Given that each thread implements conflict-driven search involving nogood learning, the 
corresponding solvers may benefit from a controlled exchange of their recorded informa- 
tion. However, such an interchange must be handled with great care because each individ- 



5 Conditional writing is performed as atomic CPU instruction to achieve synchronization in multi-threading. 

6 See |http : / /en . wikipedia . org/wiki/Semaphore_ (programming) | in case of unfamiliarity 
with the working of semaphores. 
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ual solver may already learn exponentially many nogoods, so that their additional sharing 
may significantly hamper the overall performance. 

To differentiate which nogoods to share, clasp 2 pursues a hybrid approach regarding 
both nogood exchange and storage. As described in Section [3] the binary and ternary im- 
plication graph (as well as the positive atom-body dependency graph) are shared among 
all solver threads. Otherwise, each solver maintains its own local nogood database. The 
sharing of these nogoods is optional, as we detail next. 

The actual exchange of nogoods is controlled in clasp by separate distribution and in- 
tegration components for carefully selecting the spread constraints. This is supported by 
thread-local interfaces along with the global NogoodDistributor (see Figure|2]i. All compo- 
nents rely on interfaces abstracting from the specific sharing mechanism used underneath. 

The distribution of nogoods is configurable in two ways. First, the exported nogoods 
can be filtered by their type, viz. conflict, loop, or short (i.e. binary and ternary), or be 
exhaustive or inhibited. The difference between globally sharing short nogoods (via their 
implication graph) and additionally "distributing" them lies in the proactiveness of the pro- 
cess. While the mere sharing leaves it to each solver to discover nogoods added by others, 
their explicit distribution furthermore communicates this information through the standard 
distribution process. Second, the export of nogoods is subject to their respective number of 
distinct decision levels associated with the contained literals, called the Literal Block Dis- 
tance (LBD; ( |Audemard and S imon 2009)). Fewer distinct decision levels are regarded as 
advantageous since they are prone to prune larger parts of the search space. This criterion 
has empirically shown to be rather effective and largely superior to a selection by length. 

The integration of nogoods is likewise configurable in two ways. The first criterion cap- 
tures the relevance of a nogood to the local search process. First, the state of a nogood 
is assessed by checking whether it is satisfied, violated, open (i.e. neither satisfied nor 
violated), or unit w.r.t. the current (partial) assignment. While violated and unit nogoods 
are always considered relevant, open nogoods are optionally passed through a filter using 
the solver's current heuristic values to discriminate the relevance of the candidate nogood 
to the current solving process. Finally, satisfied nogoods are either ignored or considered 
open depending on the configuration of the corresponding filter and their state relative to 
the original guiding path. The second integration criterion is expressed by a grace period 
influencing the size of the local import queue and thereby the minimum time a nogood is 
stored. Once the local import queue is full, the least recently added nogood is evicted and 
either transferred to the thread's nogood database (where it becomes subject to the thread's 
nogood deletion policy) or immediately discarded. Currently, two modes are distinguished. 
The thread transfers either all or only "heuristically active" nogoods from its import queue 
while discarding all others. 

Both distribution and integration are implemented as dedicated (complex) post propaga- 
tors, based upon a global distribution scheme implemented via an efficient lock-free Multi- 
Read-Multi-Write (MRMW) list situated in ParallelSolveQDistribution roughly works as 
follows. When the solver of Thread i records a nogood that is a candidate for sharing, it is 
first integrated into the thread-local nogood database. In addition, the nogood's reference 

7 This choice is motivated by the fact that we aim at optimizing clasp for desktop computers, still mostly pos- 
sessing few genuine processing units. Other strategies are possible and an active subject of current research. 
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counter is set to the total number of threads plus one, and its target mask to all threads 
except i. At last, Thread i appends the shared nogood to the aforementioned MRMW list. 

Conversely upon integration, Thread j traverses the MRMW list, thereby ignoring all 
nogoods whose target mask excludes j. Depending on the state of a nogood, the afore- 
mentioned filters decide whether a nogood is relevant or not. All relevant nogoods are 
integrated into the search process of Thread j and added to its local import queue. The 
reference counter of each nogood is decremented by each thread moving its read pointer 
beyond it. In addition, the sharing thread i decrements a nogood's reference counter when- 
ever it no longer uses it. Hence, the reference counter of a shared nogood can only drop to 
zero once it is no longer addressed by any read pointer. This makes it subject to deletion. 

Notably, the shared representation of a nogood is only created when the nogood is ac- 
tually distributed. Otherwise, its optimized (single-threaded) representation is used. Upon 
integration, the "best" representation is selected, for instance, short nogoods are copied 
while longer ones are physically shared (see Section[5]for implementation details). 

4.3 Complex Reasoning Modes 

In addition to model printing, all enumeration-based reasoning modes of clasp 2 are con- 
trolled by the global Enumerator (see Figure [2]). These reasoning modes include regular 
and projected model enumeration, intersection and union of models, uniform and hierar- 
chical (multi-criteria) optimization as well as combinations thereof, like computing the 
intersection of all optimal models. 

As already mentioned, one global Enumerator is shared among all threads and is pro- 
tected by a lock. Whenever applicable, it hosts global constraints, like minimize con- 
straints, that are updated whenever a model is found. Additionally, the Enumerator adds 
a local enumeration-specific constraint to each solver for storing thread-local data, e.g. 
current optima (see below). Once a model is found, a dedicated message update-model is 
send to all threads, but threads only react to the most recent one. 

In fact, enumeration is combinable with both search strategies described in Section [3] 
either by applying dedicated enumeration algorithms taking advantage of guiding paths or 
by using solution recording in a competitive setting. The latter setting exploits the infras- 
tructure for nogood exchange in order to distribute solutions among solver threads. Once 
a solution is converted into a nogood, it can be treated as usual, except that its integra- 
tion is imperative and that it is exempt from deletion. However, this approach suffers from 
exponential space complexity in the worst case. Unlike this, splitting-based enumeration 
runs in polynomial space, following a distributed version of the enumeration algorithm 
introduced in (Gebser et al. 2007a). In order to avoid uninformed splits at the beginning, 
all solver threads may optionally start in a competitive setting. Once the first model is 
found, the Enumerator enforces splitting-based search among all solver threads and dis- 
ables global restarts. In addition to the distribution of disjoint guiding paths, backtrack 
levels (see (Ge bser et al. 2007a} ) are dealt with locally in order to guarantee an exhaustive 
and duplicate-free enumeration of all models. 

In optimization, solver threads cooperate in enumerating one better model after another 
until no better one is found, so that the last model is optimal. Whenever a better model is 
found, its objective value is stored in the Enumerator. The threads react upon the following 
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update-model message by integrating the new value into their local minimize constraint 
representation^] and thus into the search processes of their solvers. Minimize constraints 
provide methods for efficiently re-computing their state after an update, so that restart- 
ing search is unnecessary in most cases. An innovative feature of clasp 2 is hierarchical 
optimization (Gebser et al. 2011b), build on top of uniform optimization. Hierarchical 
optimization allows for solving multi-criteria optimization problems by considering crite- 
ria according to their respective priorities. Such an approach is much more involved than 
standard branch-and-bound-based optimization because it must recover from several unsat- 
isfiable subproblems, one for each criterion. This is accomplished by dynamic minimize 
constraints that may be disabled and reinitialized during search. Accordingly, nogoods 
learned under minimize constraints must be retracted once the constraint gets disabled. 
Another benefit of such dynamic constraints is that we may decrease the (upper) bound in 
a non-uniform way, and successively re-increase it upon unsatisfiability. Hierarchical opti- 
mization allows for gaining an order of magnitude on multi-criteria problems, as witnessed 
in Linux configuration ( Gebser et al. 201 lc] l. 

Also, brave and cautious reasoning, computing the union and intersection of all models, 
respectively, are implemented through a global constraint within the Enumerator. When- 
ever a new model is found, the constraint is intersected with the model (or its complement). 



5 Implementation 

A major design goal of clasp 2 was to leverage the power of today's multi-core shared 
memory machines, while keeping the resulting overhead low so that the single-threaded 
variant does not suffer from a significant loss in performance. In particular, we aimed at 
empowering physical sharing of constraints and data while avoiding false sharing, locking, 
and communication overhead. To this end, our design foresees a clear distinction between 
three types of data representations, viz. 

• read-only data providing lock- and wait-free sharing (without deadlocks and races), 

• shared data being subject to concurrent updates via CAS or locks (admitting races), 
and 

• thread-local data being private to each thread and thus not sharable (avoiding dead- 
locks and races). 

Let us make this more precise by detailing the data representations of the various types 
of constraints used in clasp. Constraints are typically separated into a thread-local and a 
(possibly shared) read-only part. While the former usually contains search-specific and 
thus dynamic data, the latter typically comprises static data not being subject to change. 

As mentioned above, the implication graph is shared among all threads and stores 
inferences from binary and ternary nogoods. The corresponding data structure is separated 
into two parts. On the one hand, a static read-only part is initialized during preprocessing; it 
stores two vectors, bin( 1 ) and tern( 1 ) , for each literal 1. The former contains literals 
being forced once 1 becomes true. Similarly, the latter stores binary clauses being activated 

8 While the literals of a minimize constraint are stored globally, corresponding upper bounds are local to threads, 
and changes are communicated through the Enumerator. 
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when 1 becomes true. For better data locality, bin( 1 ) and tern( 1 ) are actually stored 
in one memory block. On the other hand, the dynamic part supports concurrent updates for 
storing and distributing short recorded nogoods. To this end, it includes, for each literal 1, 
an atomic pointer, learnt (1), to a linked list of CACHE_LINE_SIZE-sized memory 
blocks. Each such memory block contains a fixed-size array of binary and ternary nogoods. 
This setting guarantees that propagation over learnt(l) is efficient and does not need 
any locks (given that short clauses are never removed). Moreover, we rely on fine-grained 
spinlocks to enable efficient updates of fixed-size arrays. 

In analogy, longer nogoods are separated into two parts, called head and tail. The head 
part is always thread-local and is referenced in the owning thread's watch lists. It stores 
two watched literals, one cache literal, and some extra dynamic data, like nogood activ- 
ity. The cache literal provides a (potential) spare watched literal, in case one of the two 
original ones is assigned. That is, upon updating the watched literals, the cache literal is 
inspected before a costly visit of the literals in the (possibly shared) tail part is engaged]^] 
Further contents of the head part depend on whether a nogood is shared. If not, the nogood 
stores its unshared tail part, including the nogood's size and remaining literals, together 
with the head in one continuous memory block. Otherwise, the head points to a read-only 
shared tail object containing the nogood's literals, an (atomic) reference counter, and fur- 
ther static data, like the size of the nogood. The separation into a dynamic thread-local 
and a static read-only shared part is motivated by the fact that sharing only needs to repli- 
cate the search-specific state of a nogood, like its watched literals and activity. Notably, 
although a more local representation of shared nogoods would be possible, it is important 
to avoid storing dynamic data of different threads in the same coherence block (e.g. a cache 
line); otherwise, writes of one thread lead to (logically) unnecessary coherence operations 
in other threads. Our separation of data ensures that thread-local data of different threads 
is never stored together and thus avoids such "false sharing." Regarding representation, 
clasp employs the following policies. Short nogoods of up to five literals are never physi- 
cally shared, but completely stored in thread-local head parts for improving access locality. 
Original problem nogoods are physically shared in the presence of multiple threads, except 
if copying (instead of sharing) of problem nogoods is enforced. Finally, recorded nogoods 
are only shared on demand, as described in Section [4] 

Analogously to nogoods, weight constraints have a thread-local part storing current 
assignments (to enclosed literals) and the corresponding sum of weights as well as a shared 
part storing size, literals, weights, and a reference counter. The shared part of a minimize 
constraint (cf. Section |4| in addition includes priority levels of literals, and thread-local 
parts contain current (upper) bounds. 

Finally, unfounded-set checking also relies on a bipartite data representation. As 
mentioned above, it is implemented as a dedicated post propagator utilizing the (read- 
only) shared strongly connected components of a program's positive atom-body depen- 
dency graph (cf. Section [3]). This is again counterbalanced by a thread-local part storing 
assignment-specific data, like source pointers (cf. ( Si mons et al. 2002) ). 



9 The Watched Literal Reference Lists of miraxt Schubert et al. 2009 



follow a similar approach. 
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Fig. 3. Number of solved instances per time for clasp 2 and other multi-threaded SAT 
solvers. 



6 Experiments 

We conducted two series of experiments, the first comparing clasp 2 to other multi-threaded 
CDCL-based (SAT) solvers and the second assessing the impact of different parallel search 
features. In fact, efforts to parallelize CDCL have so far concentrated on the area of SAT, 
and thus we compare clasp (version 2.0.5) to the following multi-threaded SAT solvers: 
cryptominisat (version 2.9.2; ( |Soos et al. 20 09)), many sat (version 1.1; (Hamad i~et al.| 
|2009b| l), miraxt (version 2009; ( [Schubert et al. 2009| >), and plingeling (version 587f; ( |Biere| 
201 l| l). While miraxt performs search space splitting via guiding paths, the three other 
solvers let different configurations of an underlying sequential SAT solver compete with 
one another. Furthermore, nogood exchange among individual threads is either confined to 
short nogoods, only unary (plingeling) or binary ones as well (cryptominisat), performed 
adaptively (manysat; cf. ( Hamadi et al. 2009a) l), or exhaustive in view of a shared nogood 
database (miraxt). The solvers were run on a Linux machine with two Intel Quad-Core 
Xeon E5520 2.27GHz processors, imposing a limit of 1000 (or 1200) seconds wall-clock 
time per solver and benchmark instance in the first (or second) series of experiments 

Our first series of experiments evaluates the performance of clasp in comparison to 
other multi-threaded SAT solvers. To this end, we ran the aforementioned solvers on 160 
benchmark instances from the Crafted category at the 2011 SAT competition^ The plot 
in Figure [3] displays numbers of solved instances (on the y-axis) as a function of time (in 
log scale on the x-axis). As (sequential) baseline, we include clasp running one thread 
in the configuration submitted to the 2011 SAT competition. This configuration is con- 
trasted with four- and eight-threaded variants of the considered parallel SAT solvers, us- 
ing a prefabricated portfolio (clasp --create-template) for competing threads of 
clasp. First of all, we observe in Figure [3] that all multi-threaded solvers complete more 
instances than sequential clasp when given sufficient time (more than 10 seconds). This 
is unsurprising because the available CPU time roughly amounts to the product of wall- 



10 The benchmark suites are available atlhttp : / / www . cs . uni-potsdam. de/claspl 

11 From the whole collection of 300 competition benchmarks, the 160 selected instances could be solved with 
ppfolio jRoussel 201 1} , the (wall-clock time) winner in the Crafted category at the 201 1 SAT competition, 
within 1000 seconds. Without this preselection, plenty (more) runs of the considered solvers would not finish 
in the time limit, and running the experiments would have consumed an order of magnitude more time. 
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Fig. 4. Number of solved instances per time for different parallel search strategies of 
clasp 2. 

clock time and number of threads, given that our benchmark machine offers sufficient 
computing resources for concurrent thread execution. In fact, we further observe that each 
multi-threaded solver benefits from running more (eight instead of four) threads. How- 
ever, the increase in the number of solved instances is solver-specific and rather small with 
manysat, which mainly duplicates its fixed portfolio of four configurations in the transition 
to eight threads (changing only the random seed used in the branching heuristics). Unlike 
this, the other multi-threaded solvers complete between five (clasp) and eight (cryptomin- 
isat, miraxt, and plingeling) more instances in the time limit when doubling the number 
of threads. These improvements are significant because harnessing additional computing 
resources for parallel search is justified when it makes instances accessible that are hard 
(or unpredictable) to solve sequentially]^] Comparing the performance of multi-threaded 
clasp to other SAT solvers shows that clasp is very competitive, thus emphasizing the 
(low-level) efficiency of its parallel infrastructure. But please take into account that Crafted 
benchmarks are closer to ASP problems, which clasp is originally designed for, than those 
in SAT competitions' Application category, to which the other four SAT solvers are tai- 
lored. Finally, although solver portfolios (as used in ppfolio) proved to be powerful at the 
2011 SAT competition, we do not include them in our experiments because their diverse 
members are run in separation, thus not utilizing multi-threading for parallelization. 

The second series of experiments assesses parallel search features of clasp on a broad 
collection of 1435 benchmark instances, stemming from the 2009 ASP and SAT com- 
petitions as well as the 2006 and 2008 SAT races. To begin with, the plot in Figure [4] 
compares different parallel search strategies, viz. portfolio of competing threads (PORT), 
search space splitting via guiding paths (GP), splitting-based search with a portfolio of dif- 
ferent configurations (PORT+GP), and the previous setting augmented with global restarts 
(PORT+GP+GR). Note that the PORT mode matches the clasp setup that has already been 
used above, and that up to ten restarts (according to the geometric policy 500*1.5*) are 
performed globally with the PORT+GP+GR mode. As in our first experiments, we ob- 
serve that all multi-threaded clasp modes dominate the baseline of running a single thread. 
Similarly, each mode benefits from more threads, where the transition from two to four 

12 The speedup (in terms of wall-clock time) of eight-threaded over single-threaded clasp is about 1.5, which may 
seem low, but the eight-threaded variant completes 3 1 instances (with unknown sequential solving time) more. 
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Fig. 5. Number of solved instances per time for different nogood exchange policies of 
clasp 2. 



threads is particularly significant with portfolio approaches (e.g. 32 more instances com- 
pleted with PORT). In fact, the latter dominate the GP mode relying on a uniform clasp 
(default) configuration, especially when the number of threads is greater than two. This 
indicates the difficulty of making fair splits in view of irregular search spaces, while run- 



ning different configurations in parallel improves the chance of success (cf. (Hyvarinen 



et al. 20 lT) ). Although the robustness of splitting-based search is somewhat enhanced by 



running different configurations (PORT+GP) and additionally applying global restarts to 
refine uninformed splits (PORT+GP+GR), its combinations with guiding paths could not 
improve over the plain PORT mode. However, it would be interesting to scale this experi- 
ment further up (on a machine with more than eight cores) in order to investigate whether a 
portfolio becomes saturated at some point, so that combinations with search space splitting 
would be natural to exploit greater parallelism. 

Finally, Figure [3] plots the performances of clasp (PORT mode) w.r.t. nogood exchange 
policies. Given that the binary and ternary implication graph is always shared among 
all threads, the difference between the NO and SHORT modes is that short nogoods are 
recorded "silently" with NO and proactively communicated with SHORT (cf. Section 4.2 1. 
The LBD-2 and -4 modes further extend SHORT by additionally distributing "long" no- 
goods whose LBD does not exceed 2 or 4, respectively, independent of the nogood size in 
terms of literals. While the amount of solved instances is primarily influenced by the num- 
ber of threads, different nogood exchange policies are responsible for gradual differences 
between clasp variants running the same number of threads. With four and eight threads, 
the LBD modes are more successful than NO and SHORT, especially in the time inter- 
val from 10 to a few hundred seconds. This shows that the exchange of information helps 
to reduce redundancies between the search processes of individual threads; it further sup- 
ports the conjecture in (Aude mard and Simon 2009) that "our measure [LBD] will also be 
very useful in the context of parallel SAT solvers." Interestingly, even when running eight 
threads, the performances of LBD-2 and -4 modes are close to each other, with a slight 
tendency towards LBD-4. Our experiments do thus not exhibit bottlenecks due to the ad- 
ditional exchange of nogoods with LBD 3 and 4. However, more exhaustive experiments 
are required (and part of our ongoing work) to find a good trade-off between number of 
threads and LBD limit for exchange. Ultimately, dynamic measures like those suggested 
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in ( Hamadi et al. 20091) are indispensable for self-adapting nogood exchange to different 



problem characteristics, and adding such measures to clasp is a subject to future work. 



7 Related Work 

Parallel ASP solving was so far dominated by approaches distributing tree search by ex- 



tending the solver smodels in various ways (Finkel et al. 2001 Hirsimaki 2001 Pontelli 



|et al. 2003|[Balduccini et al. 2005||Gressmann et al. 2005HGressmann et al. 2006| l. While 
smodels applies systematic backtracking-based search, following the scheme of DPLL 
used in traditional SAT solving, clasp as well as modern SAT solvers are based on CDCL, 
relying on conflict-driven learning and backjumping. However, the clear edge of CDCL- 
based solvers over DPLL-based ones also brings about more sophisticated search proce- 
dures that have to be accommodated in a distributed setting. Apart from distributed con- 
straint learning, this particularly affects the coordination of model enumeration. 



The approach taken with claspar (Ellguth et al. 2009 Gebser et al. 201 Id i can be re- 
garded as a precursor to our present work, claspar is designed for a cluster-oriented setting 
without any shared memory. It thus aims at large-scale computing environments, where 
physical distribution necessitates data copying rather than sharing. In fact, claspar can be 
understood as a wrapper controlling the distribution of independent clasp instances via 



MPI ( Gropp et al. 1999| l, thereby taking advantage of clasp's interfaces for data exchange. 
However, compared to claspar, (quasi) instantaneous communication via shared memory 
enables a much closer collaboration (e.g. rapid nogood exchange) among threads in clasp. 

Although much work has also been carried out in the area of parallel logic programming, 
among which or-parallelism (Gupta et al. 2001 Chassin de Kergommea ux and Codognet| 



1994 1 is similar to search space splitting, our work is more closely related to parallel SAT 



solving, tracing back to (Zhang et al. 1996 Blochinger et al. 2003). Among modern ap 



proaches to multi-threaded SAT solving, the ones of miraxt ( Schubert et al. 2009 1 and 



manysat ( Hamadi et al. 2009b ) are of particular interest due to their complementary treat- 
ment of recorded nogoods. miraxt is implemented via pthreads and uses a globally shared 
nogood database. The advantage of this is that each thread sees all nogoods and can inte- 
grate them with low latency. However, given that multiple threads read and write on the 
database, it needs readers-writer locks. Moreover, many nogoods are actually never used 
by more than one thread, but still produce some maintenance overhead in each thread. 
manysat is implemented via openmp and uses a copying approach to nogood exchange, 
proscribing any physical sharing. That is, each among n solver threads has its own nogood 
database, and nogood exchange is accomplished by copying via n*(n— 1) pairwise distri- 
bution queues. While this approach performs well for a small number n of solver threads, 
it does not scale up due to the quadratic number of queues and excessive copying. Recent 
parallel SAT solvers further include plingeling ( Biere 20TT} and the multi -threaded vari- 



ant of cryptominisat (So os et al. 2009) >. Finally, note that, while knowledge exchange and 
(shared) memory access matter likewise in parallel SAT and ASP solving, the scope of the 
latter also stretches out over enumeration and optimization of answer sets. 
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8 Discussion 

We have presented major design principles and key implementation techniques underlying 
the clasp 2 series, thus providing the first CDCL-based ASP solver supporting paralleliza- 
tion via multi-threading. While its multi-threaded variant aims at leveraging the power 
of today's multi-core shared memory machines in parallel search, clasp 2 has also been 
designed with care not to sacrifice the (low-level) performance of its single-threaded vari- 
ant, sharing a common code base. In fact, the competitiveness of single- as well as multi- 
threaded clasp 2 variants is, for instance, witnessed by their performances at the 201 1 SAT 
competition. Beyond powerful parallel search, multi-threaded clasp 2 allows for conduct- 
ing the various reasoning modes of its single-threaded sibling, including enumeration and 
(hierarchical) optimization, in parallel. On the one hand, this makes the multi-threaded 
variant of clasp 2 highly flexible, offering parallel solving capacities for various reasoning 
tasks. On the other hand, the vast configuration space of a CDCL-based solver becomes 
even more complex, as individual threads as well as their interaction can be configured 
in manifold ways. In view of this, adaptive solving strategies (e.g. regarding nogood ex- 
change) and automatic parallel solver configuration are important issues to future work. 
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